DO i = 1, n
v(i) = v(i) + X * w(i)
ENDDOOn the R8000 architecture, this can be coded as two load instructions followed by a madd instruction and a store. See Figure 6-1.
Figure 6-1 : A Simple DAXPY Implementation This simplest schedule achieves the desired result after five cycles. Since the R8000 architecture can allow up to two memory operations and two floating point operations in the same cycle, this simple example uses only 1/10th of the R8000's peak megaflops. There is also a delay of three cycles before the results of the madd can be stored.
0: ldc1 ldc1 madd 1: 2: 3: 4: sdc1A loop unrolling by four schedule improves the performance to 1/4 th of the R8000's peak megaflops.
0: ldc1 ldc1 madd 1: ldc1 ldc1 madd 2: ldc1 ldc1 madd 3: ldc1 ldc1 madd 4: sdc1 5: sdc1 6: sdc1 7: sdc1But this schedule does not take advantage of the R8000's ability to do two stores in one cycle.The best schedule that could be achieved would look like the following:
0: ldc1 ldc1 1: ldc1 ldc1 madd madd 2: sdc1 sdc1It uses 1/3 of the R8000's peak megaflops. But there still is a problem with the madd sdc1 interlock delay. Software pipelining addresses this problem.
Software pipelining allows you to mix operations from different loop iterations in each iteration of the hardware loop. Thus the store instructions (which would interlock) in the above example could be used to store the results of different iterations. This can look like the following:
L1: 1: t1 = ldc1 t2 = ldc1 t3 = madd t1 X t2 2: t4 = ldc1 t5 = ldc1 t6 = madd t4 X t5 3: sdc1 t7 sdc1 t8 beq DONE 4: t1 = ldc1 t2 = ldc1 t7 = madd t1 X t2 5: t4 = ldc1 t5 = ldc1 t8 = madd t4 X t5 6: sdc1 t3 sdc1 t6 bne L1 DONE:The stores in this loop are storing the madd results from previous iterations. But, in general, you could mix any operations from any number of different iterations. Also, note that every loop replication completes two loop iterations in 3 cycles.
In order to properly prepare for entry into such a loop, a windup section of code is added. The windup section sets up registers for the first stores in the main loop. In order to exit the loop properly, a winddown section is added. The winddown section performs the final stores. Any preparation of registers needed for the windown is done in the compensation section. The winddown section also prevents speculative operations.
windup: 1: t1 = ldc1 t2 = ldc1 t7 = madd t1 X t2 2: t4 = ldc1 t5 = ldc1 t8 = madd t4 X t5 L1: 1: t1 = ldc1 t2 = ldc1 t3 = madd t1 X t2 2: t4 = ldc1 t5 = ldc1 t6 = madd t4 X t5 3: sdc1 t7 sdc1 t8 beq compensation1 4: t1 = ldc1 t2 = ldc1 t7 = madd t1 X t2 5: t4 = ldc1 t5 = ldc1 t8 = madd t4 X t5 6: sdc1 t3 sdc1 t6 bne L1 winddown: 1: sdc1 t7 sdc1 t8 br ALLDONE compensation1: 1: t7 = t3 t8 = t6 2: br winddown ALLDONE:Our example loop always does loads from at least 4 iterations, so we don't want to start it if we don't want speculative operations and if the trip count is less than 4. The following generalizes our example into a map of a software pipelined loop:
/* Precondition for unroll by 2 */ do i = 1, n mod 2 original loop enddo if ( n-i < 4 ) goto simple_loop windup: ... /* fp - fence post */ fp = fp - peel_amount ... swp replication 0: ... if ( i == fp ) goto compensation 0 ... swp replication n - 1: ... if ( i != fp ) goto swp replication 0 compensation n-1: winddown: ... goto done compensation 0: /* Move registers to set up winddown */ rx = ry ... goto winddown ... compensation n - 2: ... /* Move registers to set up winddown */ rx = ry goto winddown simple_loop: do i = i,n original_loop enddo done:In practice, software pipelining can make a huge difference. Sample benchmark performances see more than 100% improvement when compiled with software pipelining. This makes it clear that in order to get the best performance on the R8000 (-mips4), you should compile your application to use software pipelining (-O3).